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Abstract 

We introduce a preconditioner based on a hierarchical low-rank compression scheme of Schur 
complements. The construction is inspired by standard nested dissection, and relies on 
the assumption that the Schur complements can be approximated, to high precision, by 
Hierarchically-Semi-Separable matrices. We build the preconditioner as an approximate 
LDM^ factorization of a given matrix A, and no knowledge of A in assembled form is 
required by the construction. The LDM^ factorization is amenable to fast inversion, and 
the action of the inverse can be determined fast as well. We investigate the behavior of the 
preconditioner in the context of DG hnite element approximations of elliptic and hyperbolic 
problems. 

Keywords: Preconditioned GMRES, Interpolative Decomposition, Indehnite 
operators 


1. Introduction 


This work rests on the observation that, for a large class of problems, the dense Schur 
complement matrices that arise in the nested dissection method are rank-structured, see, 
e.g., j^, 18|. In the case of well-behaved elliptic problems, this property is a consequence 
of the rapid decay of the underlying Green function. To the contrary, in the case of wave- 
propagation problems, the same argument does not apply, and the reasons behind the rank- 
structure of the Schur complements remain poorly understood. Nevertheless, by exploiting 
this property, approximate matrix decompositions can be constructed cheaply, and turn out 
to be excellent preconditioners. 
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Linear systems that arise from finite element discretizations of wave propagation phenomena 
are typically poorly conditioned and highly indehnite. Consequently, it is both vital and 
challenging to construct effective preconditioners. Standard multigrid methods generally 
fail to carry the oscillations at the wavelength scale onto the coarse grids. Incomplete 
LU-decompositions require to assemble the global matrix, are fairly expensive to compute, 
and still lead to a number of iterations that is frequency-dependent. The lack of effective 
preconditioning techniques has lead to the use of (sparse) direct solvers. Fast methods, such 
as the fast-multipole-method, are conhned to problems for which the governing PDF’s can be 
reformulated as boundary integral equations (BIE’s). The linear systems resulting from the 
discretization of the BIE’s are dense, as opposed to sparse, and often well-conditioned. Over 
the last twenty years, a number of methods has been developed for their efficient solution. 
They are based on a rigorous understanding of the physics of the problem, along with 
sophisticated analytical arguments that rely on asymptotic expansions of special functions. 
As a result, they have limited applicability, e.g., variable coefficients problems are out of 
reach, and their efficient implementation remains quite challenging. 

As of today, because of the lack of robust preconditioners, discretizations of wave propaga¬ 
tion problems have eluded the reach of fast iterative solvers. In this work, we introduce a 
preconditioner whose construction is completely general, is parallel in nature, and hts mod¬ 
ern computational architectures. Results obtained in the context of Discontinuous Garlerkin 
(DG) hnite element approximations show that the behavior of the preconditioner, measured 
as the number of GMRES iterations, is independent of both the mesh size and the order of 
approximation. 

The paper is organized as follows. In Section H] we review the concept of Hierarchically- 
Semi-Separable matrices, and describe how a matrix can be reduced to such form within 
linear complexity. In Section |3] we develop analytical arguments to justify the low-rank 
nature of the Schur complements. The construction of the preconditioner is described in full 
details in Section 01 and numerical examples that substantiate its robustness are reported 
in Section O Finally, in Sectional we draw conclusions from this work, and point towards 
future directions of research. 


2. Rank-Structured Matrices 

In general terms, a matrix is rank-structured if the ranks of its off-diagonal blocks, obtained 
through a recursive partitioning of the index vector, are small when compared to the matrix 
size. A number of rank-structured formats have been described in the literature, e.g., the "H- 
and "H^-matrices introduced in [l3, [n|. We focus on Hierarchically-Semi-Separable (HSS) 
matrices. Their off-diagonal blocks admit low-rank decompositions into factors satisfying 
certain recursion relations that make such matrices inexpensive to store and manipulate. In 
fact, if k is the off-diagonal rank of a square HSS matrix of size iV, then such matrix can be 
applied to a vector and inverted in 0{Nk) and 0{Nk'^) operations, respectively. 
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To make the discussion precise, let ^4 be a square matrix of size N = 2^ and partition its 
index vector I = (1,..., 2^m) recursively through a binary tree with L levels, see Figure 
1 (a) Following a well-establish terminology, we call leaf nodes those nodes that belong to the 


hnest level, namely i = L. Nodes a, r on level i form a sibling pair if they share a common 
ancestor on level I — Matrix A is called an “iS-matrix” or a “semi-separable matrix” if 
there exists an integer k such that, for every sibling pair {a, r} of the tree, the off-diagonal 
blocks y4((T, r)0 and A{T,a) have (numerical) ranks equal to k, see Figure 1(b) 


For each sibling pair {a, r} on level i, the corresponding off-diagonal blocks of a semi- 
separable matrix can be factored as: 


A{a,T) = Uf^A,,r 




where square matrices A^^^r and Ar^a have size k, which is independent of £. Let us define 
the block-diagonal matrices: 

= diagjZlcr = A{a^ a) : r belongs to level Lj 
^tau = cliag{f/*^^' : a belongs to level £} for £ = 1,..., L 

= diagjV))’^^^ : cr belongs to level £} for £ = 1 ,..., L 


and let A^^'^ be comprised of all blocks A^j^r such that {a, r} is a sibling pair on level £. A 
semi-separable matrix can be factorized as follows: 


^ -i"’ (knV + • ■ ■ + Ail -f"' (r^)' + D''-* 


( 1 ) 


A pictorial description of the factorization is shown in Figure |2J 


Semi-separable matrices require to store and manipulate “tall” U- and f^-factors at all 
tree-levels. This undesirable property can be circumvented by imposing one additional 
condition. A semi-separable matrix is said to be Hierarchically-Semi-Separable (HSS) if it 
satisfies: 


^tall 



Tytall 
^ a 






for every a belonging to tree-level £ = 1,... ,L — 1, with descending sibling pair {/r, u}. This 
is essentially a condition about nesting of column and row spaces. By collecting matrices 
Ua- and into block-diagonal matrices 


= diag{t/o- : a belongs to level £} for £ = 1,..., L — 1 

= diag {14 : a belongs to level £} for £ = 1,..., L — 1 


^Throughout the paper, the MATLAB®-like notation T(cr, r) indicates the restriction of A to row-index 
vector cr and column-index vector t. 
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(a) Block indexing of matrix A resulting from partitioning of index vector I. 


1 

2 

L 



(b) Low-rank blocks, indicated in gray, for level £ = 1 (left), as opposed 
to low-rank and full-rank blocks, indicated in black, for leaf-level £ = 2 
(right). 


Figure 1: Rank-structure of matrix A. 


the following recursions are obtained: 

C.2] = di; t/m . = for(-=l,,,,.L-l 

Thus, in the case of an HSS matrix, factorization ([T]) simplihes to: 

.4 = t/S ... c/(‘> i(‘' v <'>’... V™' + ... + C/2 A<‘-> v2>' + r"'-' 

= hfi ■ ■ ■ c/i‘> v'A'. .. + ■ ■ ■ + vAi + 

= hi! (//•''"^’ • ■ ■ C/*'> H-1- i(/-'l) 

+ -4"'')v;2' + n<c 


We remark that the assumption that the (numerical) rank of sibling interactions is constant 
across the entire tree is, in practice, replaced by an assumption of boundedness, see Section 
14.41 and Section [5] for further discussion. 
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Figure 2: Factorization of a semi-separable matrix (top), as opposed to the factorization of a hierarchically- 
semi-separable matrix (bottom), for L = 2. For an HSS matrix, at the non-leaf levels, the “tall” basis 
matrices and are replaced by their “short” counterparts and V^''>. 

Rather than HSS matrices per se, the construction of HSS approximants is of practical 
interest. More precisely, given a matrix A and a tolerance e, we seek an HSS matrix 
such that IIH — H(^®®)|| < e, for some suitable norm || • ||. If H is an arbitrary square matrix 
of size N, a straightforward approach for computing yields a O(fciV^) cost, where k 

is the HSS-rank. In practice, this is prohibitively expensive. Nevertheless, under moderate 
assumptions on A, it is possible to construct at the acceptable cost of 0{Nk‘^), see 
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Thus, provided that k is independent^] of N, A can be reduced to HSS-form within 
linear complexity. Let us recall the exact result. 

Theorem 2.1. Let H be an iV x iV hierarchically-semi-separable matrix that has HSS-rank 
k. Suppose that: 

1. matrix-vector products x e-)■ Ax and x h-)■ A^x can be evaluated at a cost 

2. individual entries of A can be evaluated at a cost Tentry 

An HSS factorization of A can be computed in a time proportional to 

^mult X 2(A: -|- p) -b Trand X N{k “b p) “b Tentry X 2Nk “b Tflop X cNk'^ 

where Tj-and is the time required to generate a random number, Tfiop is the time required to 
perform a floating point operation, p is a small oversampling parameter (typically p = 10), 


^As we shall see in Section [5l this is never the case in practice, and some dependency of k upon N is to 
be expected. 
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and c is a small constant independent of N or k. m 

The construction described in jl^ relies heavily on the nesting of the basis of the off-diagonal 
blocks, and on the compression of such blocks through Interpolative Decompositions (ID), 
see, e.g.. Section 4 and 5 of for a detailed discussion on interpolative decompositions. In 
general terms, if A is an m x n matrix, an ID is a factorization of the form: 

^ _ ^(skel) p 

where is a “skeletonization” of A, namely it is an m x A: matrix constructed by selecting 

k columns of A, and P is a well-behavec0 k x n matrix that, obviously, contains an identity 
of size k. The cost of computing a rank-fc ID of A is 0(mkn\og{n)Y As shown in 0 , this 
cost can be lowered to O (mn log(Z) -|- Ikn log(?7,)), where I is an integer greater than but close 
to k (in applications, I = k + 10 is typical.) The hrst term is the cost of obtaining I samples 
of the range of A via an accelerated fast Fourier transform. Consequently, if we additionally 
assume that A can be applied to a vector within linear complexity, we can drop the hrst 
term, and the cost of computing the ID reduces to 

0[lkn\og{n)) (3) 

This is the cornerstone to the establish the complexity of the algorithm on which rests the 
proof of Theorem 12.11 


3. Analytical Apparatus 

In this section we provide an insight into the rank-structure of the Schur complements that 
arise in the construction of the preconditioner. To make the discussion precise, let us consider 
the boundary value problem: 


Cu = f in D (4a) 

Bu = g on T := cID (4b) 

where P is a linear second order differential operator, and P is a linear boundary condition. 
Whenever the problem is well-posed, there exists a Green’s function G such that: 

u{x)= [ G{x,y) f{y)dy+ [ H{x,y) g{y) dS{y) 

Jn Jr 

where H is related to the Green’s function through the boundary condition B. The right- 
hand-side of the previous equality defines the so-called solution operator for problem (|1|). A 


^In the present context, by “well-behaved”, we refer to the fact that no entry of P has absolute value 
greater than 2. Let us recall that in general it is possible to obtain an ID where the entries of P are bounded 
in absolute valued by 1, although this is an NP-hard problem. 
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rigorous derivation of the previous integral equation rests upon a generalized Green’s formula 
of the second type, while a characterization of G as the solution of a boundary value problem 
involves the notion of formal adjoint operator, see, e.g. Chapter 6.7 of (l^ . 

When £ is a uniformly elliptic operator with smooth coefficients and G has a smooth bound¬ 
ary, then the long-range interactions of G are rank-dehcient. Recently, the case of non 
smooth, L°°-coefficients, and G a bounded Lipschitz domain has been addressed in . The 
authors show that, even in this general setting, the Green’s function G can be approximated 
to high precision by an "H-matrix. In the case of homogeneous boundary conditions, this 
property immediately extends to the solution operator. When non-homogeneous boundary 
conditions are considered, we postulate that the nature of the solution operator is pre¬ 
served. 

Let A be the stiffness matrix arising from a hnite element discretization of problem (jl]). 
Let us suppose that, up to a permutation we shall omit, we can partition A as follows, and 
dehne the (aggregated) submatrices 



fA(% 


A(^h, 

\ 



A = 



71«.. 

A<% 

All’ll 


h 1 9 


V 

A''\ 


A<^>J 




In the partitioning of A, we assume the blocks on the diagonal to be square matrices. We 
proceed to characterize matrix as the discretization of a boundary value problem. 

In the hnite element method, each degree of freedom j of the stiffness matrix corresponds 
to a unique hnite element basis function ipj. We dehne the following subdomains of G via 
the supports of the basis functions: 

= U{supp(pj : j e ind(R('')ii)} , = U{supp(pj : j G ind(R*^^)bb)} 

The notation ind(i?) refers to the (row or column) indices of A that are indices of its 

submatrix B as well. We also dehne ffG) = U When the previous discretization 
scheme is applied to the boundary value problem 


Cu = f 

in 

(5a) 

Bu = g 

on T n 

(5b) 

M = 0 

on int(aG(^) \ T) 

(5c) 


it gives rise to A^^'^ as a stihness matrix. 
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An LDM^ block-factorization of is realized as follows: 


^{k) 


( I \ 
W\,Ay^y-^ i) \ 


\ (I 

J I ) 

M* 


Matrices L, M are Gauss transforms, and the Schur complement is defined as: 

By inverting the factorization, it immediately follows that the bottom-right block of 
coincides with . Since A^^') is the discrete analog of the solution operator of problem 

([5]), we conclude that is the restriction to of the discrete solution operator. 

Consequently, the inverse Schur complement has rank-dehcient off-diagonal blocks. Since 
such class of matrices is closed with respect to inversion, the Schur complement has rank- 
dehcient off-diagonal blocks as well. 

We employ the factorizations of the sub-problems ([5]) to obtain and LDM^ factorization of 
the stiffness matrix A of the original problem (11]) : 




\ 


^(2)^(1) 







V 

^(2) y 



D 


Here LA and MA are accumulated Gauss transforms, and the Schur complements SA are 
dehned as previously. In order to further proceed with the factorization, we manipulate 
Dbr, namely the bottom right block of D, in the following way: 


Dbr — 









\ 

^(1) 

^(1,2) ^ 

1 repartition 

sy\^ 

sy\b 




^(2,1) 

^(2) ^ 



~A^ 


sAy, 








S^^\b 

/ 



/ 

sAy, 


S^hb 


\ 


permute 


A<^-% 

sAy, 




regroup 




~A^ 

SA^b 

~A^ 




V 










i(2,l) 


id,2) 
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Up to a permutation we shall omit, we obtain the hnal factorization: 




A = 








5(0) J 




where the Schur complement is dehned as: 

5 '( 0 ) ^ ^( 0 )^^ _ ^( 2 , 1 ) ^( 0 ) ^( 1 , 2 ) 

If dehne ff;, = U {supp (fj : j G ind(A(°)the previous reasoning allows us to characterize 
as the restriction of the solution operator of problem (j4]) to We conclude that 
5'^°^ ^ and have low-rank off-diagonal blocks. 

Finally, let us discuss the case of C being the Helmholtz operator. Although the interpreta¬ 
tion of the inverse Schur complements as restrictions of the solution operator is preserved, 
the underlying Green’s function not longer exhibits long-range low-rank interactions. How¬ 
ever, in the case of scattering problems that involve relatively thin and elongated structures, 
the Green’s function still exhibits low-rank behavior, see [l^. As remarked in [^, the rank 
is dependent on the boundary condition employed to terminate the domain, which in our 
scenario is Bu = g on T D and only a PML condition guarantees consistently low 

ranks. Let us interpret the sets and G;, as elongated scatterers. Although we are 

not using a PML boundary condition, we conjecture that the observed numerical ranks are 
sufficiently small to yield a cheap and reliable approximation of the solution operator. 


4. Preconditioner Construction 

Our construction relies on a variant of the well-known nested dissection algorithm intro¬ 
duced by George in [^. We extend the work of Gilmann and Martinsson, see j^, developed 
in the context of hnite difference approximations of elliptic PDF’s, to hnite element ap¬ 
proximations of a general differential operator. In order to do so, instead of relying on 
geometrical considerations similar to domain decomposition techniques, we employ a purely 
algebraic, black-box approach that allows us to handle a much more general framework. We 
remark that, although the construction of the preconditioner is fully general, we expect its 
performance to be affected by the nature of the underlying PDF. 

Our approach rests upon a recursive reordering of the degrees of freedom (dof’s) of a hnite 
element stiffness matrix A. The reordering generates a binary tree which, in return, dictates 
a hierarchy of Schur complements. The novelty is that, provided that A is sparse and that all 
Schur complements are to high precision Hierarchically-Semi-Separable, the following facts 
hold true: 

1. an LDM^ factorization of A can be realized within linear complexity; 
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2. the factorization can be inverted within linear complexity; 

3. the inverse can be applied within linear complexity. 

The factorization can be realized with a trivially parallel process, whose cost is dominated 
by the cost of processing the Schur complements on the top tree level. The fact that the 
inverse factorization can be applied fast makes it perfectly suitable to be employed as a 
preconditioner. 

^.1. Matrix Reordering 

We partition the dof’s of A into two boxes, Boxi,Box 2 , and, for each box Boxj, identify 
interior dof’s P and boundary dof’s i?* in the sense of the following connectivity graph: 





/■ 



( 6 ) 


The interpretation is straightforward: distinct boxes are connected to each other, in the 
sense of an algebraic graph, through their boundaries dof’s only. At this level of exposition, 
the most attractive partition is the one that maximizes the number of interior dof’s while 
minimizing the number of boundary dof’s. The construction proceeds by repartitioning each 
box into a pair of sibling boxes, in fact creating a binary tree of boxeQ see Figure [3], and 
by identifying boundary and interior dof’s, in the sense of graph (|6]), for each new pair of 
sibling boxes. For illustration, the connectivity graph of boxes on the second tree level, i.e., 
i = 2, is: 



^Strictly speaking, this is a tree with missing route, i.e., a forest. In fact, if we were to introduce a single 
top-level box holding the entirety of the dof’s, under a purely algebraic approach, no boundary dof’s could 
be identified. 


10 



























^ = l 



1 = 2 


i = L 


Figure 3: Binary tree of boxes for dof’s partition 


The construction terminates at tree level i = L, when the newly created boxes contain a 
number of dof’s sufficiently small to allow for dense linear algebra operations at a negligible 
cost. For consistency with the notation introduced in Section [2l we switch to Greek letters 
and identify a box with its index, i.e., a = BoXg-. 

We establish an ordering for the dof’s in 7°' and and dehne the following sub-matrices 
of A: 


= A(r,r) a-hox 

A^^\b = A{B% B^) a-box 

A^^\i = A{B%r) a-box 

A^^\b = A{r, B^) a-box 

= A{B\B^) cT-box 


interior-to-interior 

boundary-to-boundary 

boundary-to-interior 

interior-to-boundary 

to r-box (boundary interaction only) 


Let (Ti,..., be the leaf boxes and order the dof’s of A by grouping together the interior 
dof’s and the boundary dof’s 7?°"^ • • •! on tree level i = L: 





\ 


aaa.. 





AA^hb ■ 


V 





As common practice, we have omitted the null blocks. 


( 7 ) 


4 . 2 . Hierarchical Matrix Factorization 

The factorization strategy recursively decouples interior dof’s from boundary dof’s through 
Gauss transforms, starting from the leaf-boxes, all the way up to the top tree-level. The 
Schur complements that arise in the process are treated through accelerated linear algebra 
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techniques. More specifically, the complement associated to a box is obtained by a fast 
merge of the complements of its child-boxes. 

For each box a, we define Gauss transforms as the unit-lower-triangular matrices and 
such that: 




= [aA)^^ "A^y^y ( 8 ) 


Since it is well-understood how Gauss transforms accumulate and commute with permuta¬ 
tions, we henceforth refer to them generically as L and M. We employ the Gauss transforms 
to decouple the top-left super-block of A: 



MGi),, 

AAA.. 

\ 

L-^AM-^ = 



SAi) y4(o-i,o-2) 

^(cr2,o-l) 

y4Gi,o-„) 


V 



J^{crn,crn-i) j 


The top-right and bottom-left super-blocks vanish, while Schur complements S^') = AA\f^ — 
AA)^AA),r aG) ib appear on the diagonal of the bottom-right super-block. 

In order to recursively proceed in the factorization, for each pair of sibling boxes {/r, u} with 
common ancestor a, we define the remaining interior dof’s / and the remaining boundary 
dof’s B'^ as: 

r = {B>^ U B'^) OF , B^ = (5^ U B‘') n B'^ 

Gonsequently, is a partition of the aggregated boundary B^ U B^, while {B'^ fl 

1°^, B^ n B^} is a partitioE@ of B^. Up to a permutation we shall omit, we partition the 
Schur complement S'Gl as: 

where the blocks are defined as: 


s(A.. = n n 

n B^, B^^ n B^) 

n 5-^ n F) 

n F, B^^ n B^) 


®The two sets are evidently disjoint. Furthermore: D F) U fl B'^) = B^ fl (7°’ U B°') = B^ n 

{Bf^UB'^) = B^^. 
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Let us define matrices: 











bb 


= Um 



and A^'^\i analogously to In order to simplify the notation, A^^’A indicates a generic 

sub-matrix of A^’A that can be inferred from the context^. By reordering the boundary dof’s 
..., as ..., /(■), bA^ ... ^ bA^ where the omitted superscripts are the parent 
boxes of the leaves ui,..., cr„, equation dH]) becomes: 



where the symbol * indicates an omitted non-zero block. 


The factorization proceeds recursively by exploiting the fact that the bottom-right super¬ 
block has structure that is identical to that of the original matrix A, compare to equation 
(j7]). The interior dof’s fA ^..., IA are recursively decoupled through Gauss transforms until 
the top tree-level is reached. The Schur complement that originates for the elimination of 
the J®" dof’s is dehned as: 


SiA 


AA,A\ 

sA),J 




fSA),, AA^A\ 

\AAA sA),J 




/SM.. iG,G\ 

\AAA sA),J 


fSA),, 

SAAJ 


A(‘’hb 


When the decoupling process terminates, we obtain the following factorization: 


/Aaa. 


\ 


L-^AM-^ = 


\ 


SA) 

AAA) 


aaa 

sA) y 


( 10 ) 


( 11 ) 


The boxes ui,... ,ct„ are ordered starting from the bottom tree-level and, for consistency 
of notation, we have set AAA. = aAA.^qy the leaf-boxes as well. The matrices L and M 
are the accumulated Gauss transform^, relative to all boxes, excluding those on the top 


^Distinct instances of the symbol should be regarded as different matrices. 

^As before, permutations have been omitted in the definition of the Gauss transforms. 
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tree-level: 

L = ; M = 

By setting 

'4'“’ = ('Jm) 5 ( 2 )’) ; r' = diag{il->ii,...,i<'’"'„,i‘»>} ( 12 ) 

we obtain the desired LDM* factorization of A. 

4 . 3 . Fast Merging of Schur Complements 

The hierarchy of the boxes dictates a hierarchy of the Schur complements in the sense that 
depends only upon the Schur complements of the child-boxes of cr, namely and 
S^'^\ and some blocks of the original matrix A, see equation (fTOj) . Informally, we say that 
is obtained by “merging” together and As we are about to show, when 
and are in HSS-form, the merging procedure can be performed fast, in the sense that 
an HSS approximation to can be computed within linear complexity. 

The fast merging procedure is based the following assumptions: 

1. all sub-matrices A^’’’^ of A are sparse; 

2. all leaf-node Schur complements and their inverses ^ are in HSS-form. 

In general, the property of a matrix to be sparse does not carry over to an arbitrarily 
selected sub-matrix. In the context of high-order hnite element approximations, this is 
well-established. In fact, dense blocks allow advanced finite element solvers to employ high- 
performance dense linear algebra. We remark that assumption ([1]) is much milder, since it 
only requires matrices describing boundary-to-boundary sub-interactions to be sparse. 

For ease of exposition, let us recall the definition of as in dOD: 

A<~‘’hb 

Assume that and are in HSS-form. The fast merging procedure is performed 
through the following steps. 

Step 1 Since a sub-matrix of an HSS matrix can be trivially obtained through a tall and 
skinny permutation matrix P, e.g., S^^\b = P, matrices AA\b, AA\^^ and 

ib can be applied to a vector within linear complexity. 

Step 2 The action of the inverse of AP^h on a vector is equivalent to the solution of 
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input : zi, Z2, 
output: xi, X 2 , \ 

— STEP 2.1 — 

determine permutations —)■ fl and P„ ■. S'" ^ B’^ D 

define = P^ P^, and = P* P^; 
compute ^ through fast inversion; 

— STEP 2.2 — 

compress ^ as in Theorem 12.11 

compute S^^\i ^ through fast inversion; 

— STEP 2.3— 

compute X 2 = ^{z 2 — S^^'>a ^ Zi), and Xi = 5'*^^^** ^Zi — S^^\i ^ X 2 

through fast application; 

Algorithm 1: Compression and fast application of A^^^i . 


the linear system: 

\x2) \Z2) 

where z = ( 2 : 1 , ^ 2 ) and x = (xi,X 2 ) have been partitioned according to the 
blocking of A^'^\i. A standard block-solve yields: 

X 2 = S^''\^~\z 2 - zi) (13a) 

xi = z^ - X2 (13b) 

where S^''\i = S^^\i ^ Let us remark that matrices S'*'^A ^ 

and S^’'\i fully describe the action of through equations ffT5]) . Since 

can be applied within linear complexity, by virtue of Theorem 12.11 it can 
be compressed efficiently. The cost of the current step is: 

0(#AA) = 0(#AA) + 0(#AA) + 0(#rA) 

inversion of compression of ii inversion of ii 

The procedure is detailed in Algorithm ([1]). 

Step 3 Under the assumption that #.8°^ = the previous steps imply that S' A) 

can be applied to a vector within linear complexity. By Theorem 12.11 it can be 
reduced to HSS form at the cost 0(#.B'^fc^). 

Let us remark that, although computing the HSS form of a parent Schur complement is a 
linear complexity operation, there is not guarantee that it is HSS to high precision. We ex¬ 
pect this property to be dictated by the nature of the underlying PDE and the discretization 
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input : A, binary tree partitioning the dot’s of A 
output: for all cr’s in the binary tree 

— STEP 1 — 

for a on level L do 

compute by a straightforward approach; 

compress by a straightforward approach; 

end 

— STEP 2 — 

for £ = L — 1 ,..., 1 do 

for a on level i do 
form 

compress as in Algorithm [H 

compress = A^'^\f, — A^'^\i A^'^lny as in Theorem 12.II 

end 

end 

Algorithm 2: Computation of Schur complements through fast merging. 


method, and affect the cost/effectiveness ratio of the preconditioner. This is investigated 
through numerical experiments, that are presented in Section [5l 


4■4- Approximate Matrix Factorization 


Let A have dimension N and select a number of tree levels L so that each leaf box holds a 
sufficiently small number of dof’s m = N/2^ to allow for dense linear algebra manipulations 
at negligible cost. At the level of leaf-boxes, the Schur complements are computed and 
compressed to HSS-form in a straightforward manner at a cost proportional to O(m^). 
Starting from the leaf-boxes, the Schur complements are merged together using the strategy 
described above, until the top tree-level is reached. The procedure is detailed in Algorithm 
[21 The cost of the algorithm is dominated by the cost of processing the boxes on the top 
level, namely cr on level £ = 1. Apart from the leaf nodes, the uncompressed 

Schur complements are never formed explicitly. Since the Gauss transforms L and M are 
obtained from the matrices, the fast merging process described in Algorithm [2] does, 

in fact, produce an approximate LDM^ factorization of A. 

Since the Gauss transforms can be trivially inverted, the cost of inverting the LDM^ factor¬ 
ization is tantamount to the cost of inverting D which, in turn, see equation f[T2l) . coincides 
with the cost of inverting block defined as: 


/ ^( 1 ) 

VA(2T) ^(2) J 
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input : zi, ^ 2 , ^(2,1) 

output: xi, X 2 , 3^“^^ 

— STEP 1 — 

compute 5*'^^ through fast inversion; 

compress S^^'> = — A^‘^’3 5'(i) ^ as in Theorem 12.11 

compute through fast inversion; 

— STEP 2 — _ 

compute X 2 = 5'^^^ ^{z 2 — 5'(i) ^ ^i), and Xi = ^ Zi — 3^^^ ^ X 2 through 

fast application; 

Algorithm 3: Compression and fast application of 


The inversion of A^^'^ can be achieved with a slight modihcation of Algorithm [H described 
in Algorithm |3l We conclude that the cosll^ of building the inverse factorization is 

0{#B^kA = 0(#.B"fc2) + 0{i^B^kA 

LDM^ factorization inversion of 


for a on level 1 =1. 

In order to obtain cost estimates with respect to the problem size A, we evaluate the number 
of dof’s in B^ through a geometrical argument, namely: 

l-l/d 

#5°" = ( ^ I ! box a on level ^ (14) 

for some parameter d. The most conservative estimate is obtained for the limit case d oo, 
which yields: 

N 

= — , for box a on level ^ 

Taking into account that H^B^ < #5°^, the cost of building the inverse factorization is 

0{i^B^kA = 0{NkA (15) 

We conclude that the construction cost of the preconditioner, i.e., the inverse factorization, 
is linear with respect to N, provided that k is independent of N. 

Let us briefly comment on the last statement. The assumption that k is independent of A, 
and solely dictated by the nature of the problem and the discretization, is unrealistic. In 
practice, we are interested in the asymptotic behavior of k, and some mild dependency upon 
A is acceptable, as long as k/N I 0. In this respect, estimate (fT5|) is deceptively optimistic. 


®As previously, we assume #5"^ = 0(#/‘^). 
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On the other hand, taking d 'I' cxo in estimate (IT^ could be excessively conservative, and lead 
to an overly pessimistic result. As numerical experiments presented in Section |5] suggest, 
both of those scenarios are possible and the construction cost is tightly related to strategy 
used to increase the problem size. We defer further discussion to Section [5l 

Finally, we remark that the inverse factorization can be applied to a vector 

within linear complexity, thus making it suitable to be employed as a preconditioner. 


5. Numerical Results 


Discontinuous Galerkin (DG) hnite element approximations provide a natural environment 
for testing the proposed low-rank preconditioner. In fact, due to the doubling of dof’s on 
the elements boundaries, the partitioning can be interpreted in purely geometrical terms, 
and the identihcation of interior and boundary dof’s for each box is easily achieved. We 
investigate the behavior of the preconditioner for the following problems: 


1. Poisson equation; 

2. anisotropic reaction-diffusion equation; 

3. Helmholtz equation; 

4. Helmholtz equation with high material contrast. 


We rely upon Interior Penalty DG formulations and employ nodal DG hnite element dis¬ 
cretizations, see 1^. All problems are formulated on the unit square D = (0,1)^, on which 


we lay a structured triangular grid. We shall investigate the behavior of the preconditioner 
with respect to the domain partitioning scheme, see Figure IU the mesh size h, and the 
polynomial degree of approximation p. For the Helmholtz operator, the wave number k, is 
chosen as a function of the discretization parameters h and p, so that the dispersion error, 
see HI, 3 , remains constant as the size of the problem varies. 


In order to assess the validity of our theoretical framework, we analyze the rank-structure of 
the Schur complements that emerge from discretizations of the Laplace and the Helmholtz 
operators. A qualitative study that compares the Schur complements of the two operators 
is illustrated in Figure |5l For a hxed problem size, we descend the tree focusing on the 
Schur complements relative to the hrst box on each level. At the top level, the Schur 
complement of the Laplace operator exhibits off-diagonal blocks that are rank-dehcient, and 
more so than their counterparts relative to the Helmholtz operator. As we descend the tree, 
the difference between the Schur complements of the two operators becomes less evident. 
Further numerical experiments indicate that this behavior is consistent across problems of 
different sizes. 


We proceed by studying the behavior of the HSS-rank k of the Schur complements relative 
to the Laplace operator as the problem size N is increased through h-re£nements or p- 
enrichments, see Figure [6l The numerical experiments showed a uniformity of behaviors of 
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the Schur complements across all tree-levels. Thus, we limit our presentation to the Schur 
complement relative to the hrst box on the top level. When h-re£nements are performed, 
k is independent of the problem size while, in the case of p-enrichments, k grows roughly 
as the square root of the size n of the Schur complement. In the case of the Helmholtz 
operator, we postulate that k grows as log n in the case of h-refinements, and as in the 
case of p-enrichments. 

Taking into account that estimate ffTT|) reduces to #5°^ = in the case of h- 

rehnements, and to = O(A^) in the case of p-enrichments, we can revisit the cost 

estimate flT^ as follows: 

Laplace operator: 


cost of processing a on level i (h-re£nements) 
cost of processing a on level £ (p-enrichments) 


io(iVV=) 


Helmholtz operator: 

cost of processing a on level £ (h-rehnements) 
cost of processing a on level £ (p-enrichments) 


io(iV‘/Mog2]V) 


(16a) 

(16b) 


(17a) 

(17b) 


The estimates corresponding to the two scenarios, i.e., h-re£nements as opposed to p- 
enrichments, are dramatically different. Apart from the different growth rates of /c, this 
is dictated by the following geometrical argument. Given a fixed box, e.g., a box on the 
top tree-level, when h-refinements are performed, we observe a “thinning” of the boundary, 
namely the number of boundary dof’s grows slower than the total number of dof’s in the 
box. On the other hand, in the case of p-enrichments which corresponds to the limit d t oo, 
the box contains a hxed number of elements, and no “thinning” of the boundary occurs. In 
applications of practical interest, the problem size is increased through adaptive strategies, 
that combine h-re£nements and p-enrichments to produce optimal meshes. Estimates (I16bj) 
and fll7bH should be regarded as the worst-case scenarios, which are not to be encountered 
in practice. 

Although “time” (construction and application) is the ultimate measure of performance of 
a preconditioner, it is spectacularly implementation-dependent, and it makes little sense in 
the context of a non-optimized implementation. We take the following approach. Since the 
Schur complements are compressed by recursively performing ID’s of appropriately select 
sub-blocks, we evaluate the total compression cost through estimate ([3]). More precisely, we 
select the anticipated rank I according to the above discussion, and employ I and the actual 
rank k, as returned by the ID, in the computation of the final estimate. We investigate 
the agreement between such empirical cost and the theoretical estimates (ITBD and (ITT)) . We 
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measure the performance of the preconditioner in terms of GMRES iterations. If I is properly 
selected, as a function of N, we expect the performance of the preconditioner not to degrade 
as the problem size is increased. We compare the performance of our preconditioner to that 
of a standard ILU preconditioner. 

As a hrst example, we consider the Poisson equation on with a homogenous Dirichlet 
boundary condition: 


—Am = / in 

u = 0 on dfl 


As anticipated, the problem is discretized using an Interior Penalty DG formulation. We 


build the preconditioner by recursively partitioning the domain into boxes, see Figure 4(a) 
and study its behavior by comparing it to that of a standard ILU preconditioner. The 
construction cost agrees with the theoretical estimates fflB]) . and the performance, i.e., the 
number of GMRES iterations, is independent of h, and p, see Figure [71 


In order to investigate the role played by the partitioning scheme, we employ a constant co¬ 
efficients reaction/diffusion problem, with an anisotropic diffusion tensor A = diag(aii, 022 ): 


— div(A Vm) 4-CM = / in G 
M = 0 on dQ 


( 18 a) 

( 18 b) 


We construct the diffusion tensor so that it displays a strong anisotropy, i.e., an S> 022 , 
and investigate the problem as it transitions from a diffusion-dominated regime to reaction- 
dominated regime. We dehne Oq = 022 / 011 , Cq = c/on, and /o = //an, and proceed to 
normalize equation fllSal) by an: 


(9^m 



+ Co u 


fo 


in G 


We set Oo -C 1, i.e., the diffusion is much stronger along the x-axis, as compared to the 
y-a.xis, and study the behavior of the preconditioner for cq <C 1, Cq = 1, and Cq 3> 1, see 
Figure m When the problem is not reaction-dominated, i.e., Cq -C 1 or cq = 1, the choice of 
the partitioning scheme is crucial. As in accordance with the nature of the problem, vertical 
slabs work very poorly, while horizontal slabs work excellently. The performance of the box 
partitioning scheme is comparable to that of the vertical slab partitioning, thus it can be 
regarded as a general purpose scheme. As the problem transitions to a reaction-dominated 
regime, the domain partitioning scheme becomes less and less relevant. 
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Finally, we move to indefinite problems. Let us consider the Helmholtz equation on H with 
a homogenous Dirichlet boundary condition: 

— Am — = f in 

u = 0 on dfl 


This boundary value problem describes propagation of forced waves inside a soft cavity. 
Although exterior scattering problems are often of greater interest, the cavity problem is 
computationally more challenging because of possible resonances. Remarkably, the behavior 
of the preconditioner is qualitatively similar to that observed in the case of the Poisson prob¬ 
lem. The construction cost is in agreement with estimates ffT7|l and the performance of the 
preconditioner does not deteriorate as the problem size is increased through h-re£nements 
or p-enrichments, see Figure [H 

As a last example, we consider the following Helmholtz problem: 


— div 



- K^U = f 
M = 0 


in H 
on dVt 


This problem is of interest in, e.g., topology optimization, see j^. In a nutshell, topology 
optimization is an iterative method to create highly optimized designs by determining a 
distribution of material, namely p, that fulhlls a specihc task, in a locally optimal manner. 
Consequently, the density q varies by orders of magnitude over the domain. For ease of 
implementation, we assume it to be constant over each element. We place a rectangular 
enclosure D with density qd inside the domain H, and assume qd 3> Qo, where Qq is the 
density in Q\D, see Figure 10(a) We observe a very mild dependency of the behavior of 
the preconditioner upon the domain partitioning scheme, see Figure 10(b) As expected. 


the horizontal slab partitioning delivers the best performance, which is virtually matched by 
that of the box partitioning. Although the vertical slab partitioning a priori appears as the 
least appropriate one, its performance is not dissimilar from those of the other partitioning 
schemes. 


6. Conclusions 

We have presented the construction of a preconditioner that exploits low-rank compression 
of Schur complements. The construction can be viewed as a variant of the well-known 
nested dissection algorithm, since it employs a reordering of the degrees of freedom which 
allows for an advantageous elimination order. Our approach, originally inspired by the 
work of j^, follows a black-box approach that gives the construction the flexibility to be 
applied to a number of discretization techniques, such as finite differences, CG finite elements 
and DG hnite elements. The preconditioner can be applied within linear complexity, and 
we provide an estimate of the construction cost. Such estimate depends widely upon the 
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(a) Boxes (b) Horizontal slabs (c) Vertical slabs 

Figure 4: Domain partitioning schemes. 


strategy used to vary the problem size (h-refinement as opposed to p-enrichments, in the 
case of hnite elements approximations), with a worst-case-scenario of quadratic growth. 
Although this issue requires further investigation, we believe that for applications of practical 
interest, the construction cost is within linear growth. We tested the performance of the 
preconditioner on DG approximations of elliptic as well as hyperbolic problems, see (l^ . 
and demonstrated its robustness. More specihcally, the preconditioned system is solved 
within a number of GMRES iterations that is independent of both the mesh size and the 
order of approximation. The choice of DG finite elements approximations was dictated 
by implementation convenience, namely the reordering of the degrees of freedom has a 
straightforward geometrical interpretation, and should not be viewed as a limitation of the 
applicability of the preconditioner. In principle, the reordering of the degrees of freedom 
can be performed with any graph-partitioning software. 


The construction of preconditioners for linear systems that arise from wave propagation 
phenomena is notoriously challenging, see 
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The proposed preconditioner is based on 
a completely general construction and has proven effective for a number of problems. In 
terms of future research developments, the most pressing issue is to verify the agreement 
between the actual construction cost and its theoretical estimate through an optimized 
implementation. This will allow us to assess whether the asymptotic region is reached for 
problem sizes of practical relevance. Finally, we should investigate the effectiveness of the 
preconditioner over a larger class of problems, e.g., coupled multi-physics problems, and 
explore the possibility of adapting the construction to time-dependent problems. 
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Figure 5: Rank-structure of Schur complements of the Laplace operator (left column) as opposed to those 
of the Helmholtz operator (right column). From top to bottom, finer and finer tree-levels are considered. 
The colors indicate the relative rank of the corresponding sub-blocks. 
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Figure 6: Rank growth of Schur complements of Laplace operator. Data refer to ranks of largest off-diagonal 
block of complements on the top tree level. The problem size is increased through h-refinements (circles) or 
p-enrichments (squares.) 
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Number of GMRES iterations Problem dimension 


(a) Problem size is increased through h-refinements. 


10'’ 



Number of GMRES iterations 



(b) Problem size is increased through p-enrichments. 

Figure 7: Solution of a Poisson problem through GMRES. Circles refer to the proposed low-rank precondi¬ 
tioner, and squares refer to a standard ILU preconditioner. The performance of the low-rank preconditioner, 
i.e., the number of GMRES iterations, does not deteriorate as the problem size increases. 
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Problem dimension 


(a) Problem size is increased through h-refinements. 




(b) Problem size is increased through p-enrichments. 

Figure 8: Solution of a Helmholtz problem through GMRES. Circles refer to the proposed low-rank precondi¬ 
tioner, and squares refer to a standard ILU preconditioner. The performance of the low-rank preconditioner, 
i.e., the number of GMRES iterations, does not deteriorate as the problem size increases. 
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Figure 9: Solution of a strongly anisotropic, i.e., qq = 10“®, reaction/diffusion problem through GMRES. 
The problem size is increased through /i-refinements. Circles refer to box partitioning of the domain, squares 
refer to horizontal slab partitioning, and diamonds refer to vertical slab partitioning. As cq increases and 
the problem transitions for diffusion-dominated to reaction-dominated, the performance of the low-rank 
preconditioner becomes independent of the partitioning scheme. 
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(a) Domain partitioning schemes (enclosure D is indicated in gray.) 
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(b) Circles refer to box partitioning, squares refer to horizontal 
slab partitioning, diamonds refer to vertical slab partitioning 

Figure 10: Solution of a high-material-contrast Helmholtz problem through GMRES. The material enclosure 
D has non-dimensional density qd = 10^, while the remaining portion of the domain, i.e., V, \ D, has 
density = 1- The problem size is increased through h-refinements. The performance of the low-rank 
preconditioner exhibits a mild dependency on the partitioning scheme. 
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